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We examine perfect information stochastic mean-payoff games - a class of games containing as 
special sub-classes the usual mean-payoff games and parity games. We show that deterministic 
memoryless strategies that are optimal for discounted games with state-dependent discount factors 
close to 1 are optimal for priority mean-payoff games establishing a strong link between these two 
classes. 

1 Introduction 

One of the recurring themes in the theory of stochastic games is the interplay between discounted games 
and mean-payoff games. This culminates in the seminal paper of Mertens and Neyman fPH showing 
that mean-payoff games have a value and this value is the limit of the values of discounted games when 
the discount factor tends to 1. Note however that optimal strategies in both games are very different. As 
shown by Shapley 1131 discounted stochastic games admit memoryless optimal strategies. On the other 
hand mean-payoff games do not have optimal strategies, they have only £-optimal strategies and to play 
optimally players need an unbounded memory. 

The connections between discounted and mean-payoff games become much tighter when we con- 
sider perfect information stochastic games (games where players play in turns). As discovered by Black- 
well (3l, if the discount factor is close to 1 then optimal memoryless deterministic strategies in discounted 
games are also optimal for mean-payoff games (but not the other way round). Thus both games are re- 
lated not only by their values but also through their optimal strategies. Blackwell's result extends easily 
to two-player perfect information stochastic games. 

What happens if instead of mean-payoff games we consider parity games - a class of games more 
directly relevant to computer science |9|? In particular, are parity games related to discounted games? 

It is well known that deterministic mean-payoff games and parity games are related, see Q. The first 
insight that there is some link between parity games and discounted games is due to de Alfaro at al. HI. 
It turns out that parity games are related to multi-discounted games with multiple discount factors that 
depend on the state. This should be compared with discounted games with a unique, state independent, 
discount factor which are used in the study of mean-payoff games. 

Like in the classical theory of stochastic games, we examine what happens when the discount factors 
tend to 1 , the idea is that in the limit we want to obtain parity games. Note that if we have several state 
dependent discount factors X\ , . . . Xk then there are two possibilities to approach 1 : 

• we can study the iterated limit lim^j^j . . . lim^^! when discount factors tend to 1 one after another 
(i.e. first we go to 1 with the discount factor Xk associated with some group of states, when the 
limit is reached then we go to 1 with the next discount factor Xk-i etc., 

• another possibility it to examine a simultaneous limit when all factors go to 1 at the same time but 
with different rates, this will be made precise in Section [4] 
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The first approach is easier to handle than the second but it leads to weaker results, in particular we lose 
the links between optimal strategies in discounted games and optimal strategies in parity games. 

We began our examinations of relations between discounted and parity games in J31 |5l where we 
limited ourselves to deterministic games. Already this preliminary work revealed that the natural frame- 
work for such a study goes far beyond parity games. In fact parity games are related to a very particular 
restricted class of discounted games and when we examine all multi-discounted games then at the limit 
we obtain a new natural class of games — priority mean-payoff games. This new class contains the usual 
mean-payoff games and parity games as special subclasses. 

The next natural step is to try to extends the results that hold for deterministic games to perfect 
information stochastic games. In two papers JTIIH we obtained some partial results in this direction. In 
||7l we considered a class of games that contains parity games but does not contain mean-payoff games. 
We showed that such games can be seen as an iterated limit of discounted games — a limit in a very 
strong sense, not only the value of the discounted games converges to the value of the parity game but 
also optimal strategies in one class are inherited by the class of games obtained in the limit. But these 
results are not satisfactory for two reasons, the class of games for which we were able to carry our study 
is too restrictive. This class involves some technical restrictions on discounted games, which are natural 
for parity games, but not so natural for discounted games. The second problem comes from the fact that 
|P71 uses the iterated limit of discount factors and not the more interesting simultaneous limit. 

In the second paper Q we considered priority mean-payoff games in full generality, with no artificial 
restrictions, and we examined directly the limit with the discount factors tending to 1 with different rates 
rather than the iterated limit. However Q deals only with one-player games and it examines only games 
values, the paper does not provide any relation between optimal strategies in multi-discounted games and 
optimal strategies in the priority mean-payoff games in the limit. 

In the present paper we remove all restrictions imposed in QUI. We consider the full class perfect 
information stochastic priority mean-payoff games and we show that such games are a limit of discounted 
games with discount factors tending to 1 with the rates depending on the priority. Not only at the limit 
the value of the discounted game equals to the value of the priority mean-payoff game but also optimal 
deterministic memoryless strategies in discounted games turn out to be optimal in the the corresponding 
priority mean-payoff game. 

The interest in such a result is threefold. 

First we think that establishing a very strong link between two apparently different classes of games 
has its own intrinsic interest. 

Discounted games were thoroughly studied in the past and our result shows that algorithms for such 
games can, in principle, be used to solve parity games (admittedly all depends on how much the discount 
factor should be close to 1 in order that two types of games become close enough, and this remains open). 

Another point concerns the stability of solutions (optimal strategies and games values) under small 
perturbations. When we examine stochastic games then the natural question is where the transition 
probabilities come from? If they come from an observation then the values of transition probabilities are 
not exact. On the other hand algorithms for stochastic games use only rational transition probabilities 
thus even if we know the exact probabilities we replace them by close rational values. What is the impact 
of such approximations on solutions, are optimal strategies stable under small perturbations? Usually we 
tacitly assume that this is the case but it would be better to be sure. Since Blackwell-optimal strategies 
studied in Section@]are stable under small perturbations of discount factors (because they do not depend 
on the discount factor) this adds some credibility to the claim that Blackwell optimal strategies are stable 
for parity games. 

And the last point. Blackwell invented Blackwell optimality because he was not satisfied with the 
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notion of optimal strategies for mean-payoff Markov decision processes. However the same can be said 
about parity games, we defer examples to the final section. 

The paper is organized as follows. In Section [2] we introduce stochastic games in general, we define 
the notions of value and optimal strategies. Section [3] we examine discounted games. The main result in 
this section shows that if discount factors are close to 1 then optimal strategies stabilize (Blackwell opti- 
mality). In Section [5] we introduce the class of priority mean-payoff games — this is the principal class 
of games examined in this paper. Parity games and mean-payoff games are just very special subclasses 
of this class. In Section [6] we prove the main result of the paper stating that deterministic memoryless 
strategies optimal for discounted games for discount factors sufficiently close to 1 are optimal in derived 
priority mean-payoff games. 

2 Stochastic Games with Perfect Information 

Notation. In this paper N stands for the set of positive integers, No = NU {0}, and R + is the set of 
positive real numbers. 

For each finite set X, @(X) is the set of probability distributions over X, i.e. it is the set of mappings 
p : X -> [0, 1] such that £ xeX p(x) = 1. The support of p G S(X) is the set {x G X : p(x) > 0}. 

2.1 Games and Arenas 

Two players Max and Min are playing an infinite game on an arena. An arena is a tuple 

St = (S,S M ax,S M in,A,(A(5')) iS . eS ,5), 

where a finite set of states S is partitioned in two sets, the set Sjyiax of states controlled by player Max and 
the set SMin of states controlled by player Min. For each state s£S there is a non-empty finite set A(s) 
of actions available in s, A = \J seS A(s). Players Max and Min play on a/ an infinite game. If at stage 
i G No the game is in a state 57 G S then the player controlling s chooses an action from A{s) and a new 
state Si+\ is chosen with probability specified by the transition mapping 8. Transition mapping 8 maps 
each pair (s, a), where s G S and a G A(s), to an element of 3! (S). Intuitively, if in a state s and an action 
a is executed then 8(s,a)(t) gives the probability that at the next stage the game is in state t. To simplify 
the notation we shall write 8{s,a,t) rather than 8(s,a)(t). 

Throughout the paper we assume that all arenas are finite, i.e. the sets of states and actions are finite. 

An arena is said to be a one-player arena controlled by player Max if, for every state s controlled by 
Min, the set A (s) is a singleton (in particular if all states are controlled by Max then a/ is a one-player 
arena controlled by Max). One-player arenas controlled by player Min are defined similarly. 

A finite (resp. infinite) play in the arena fff is a non-empty finite (resp. infinite) sequence of states 
and actions in (SA)*S (resp. in (SA) ffl ). In the sequel "play" without any attribute will be used as a 
synonym of "infinite play". 

2.2 Payoffs 

After an infinite play player Max receives a payoff from player Min. The objectives of the players are 
opposite, the goal of Max is to maximize the payoff while player Min wants to minimize the payoff. 

The payoff can be computed in various ways. For example in a mean-payoff game each state is 
labeled with a real number called the reward and after an infinite play the payoff of player Max is the 
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limit of mean values of the sequence of rewards. In a parity game, each state is labeled with an integer 
called a priority and player Max receives payoff or 1 depending on the parity of the highest priority seen 
infinitely often. In both examples, the way the payoffs are computed is independent from the transitions 
rules of the game (the arena), it depends uniquely on the play. 
Thus formally a payoff function is a mapping 



from infinite plays to real numbers. 

A game is a couple T = (s>f,u) made of an arena and a payoff function. Usually we consider not 
a particular game but rather a class of games. In this case arenas are endowed with some additional 
structure, usually some labeling of states or actions (for example rewards as in mean-payoff games or 
priorities as in parity games) and this labeling is used to define the payoff for games in the given class. 

2.3 Strategies 

Playing a game the players use strategies. A strategy for player Max is a mapping a : (SA)*SMax — > 
Qi (A) such that for every finite play p = soaQS\a\ ■ • • s„ with s n G SMax, the support of o{p) is a subset of 
the actions available in s n , i.e. for all a G A, if o(p)(a) > then a G A(s n ). 
Strategies for player Min are defined similarly and denoted T. 

Certain types of strategies are of particular interest. A strategy is deterministic if it chooses actions 
in a deterministic way, and it is memoryless if it does not have any memory, i.e. choices depend only on 
the current state of the game, and not on the past history. Formally: 

Definition 1. A strategy G of player i G {Min, Max} is said to be: 

• deterministic if, Vp G (SA)*S,-, if o{p)(a) > then o{p)(a) = 1, 

• memoryless if Mt G S; and p G (SA)*, o{pt) = o(t). 

For any finite play p G (SA)*S and an action a G A we define the cones @{p) and 6{pa) as the sets 
consisting of all infinite plays with prefix p and pa respectively. 

In the sequel we assume that the set of infinite plays (SA)' 8 is equipped with the a-field 3§((SA) m ) 
generated by the collection of all cones 0{p) and ff(pa). Elements of this a-field are called events. 
Moreover, when there is no risk of confusion, the events &(p) and @{jpd) will be denoted simply p and 
pa. 

Suppose that players Max and Min are playing accordingly to strategies a and t. Then after a finite 
play soai ...s„ the probability of choosing an actions a n+ \ is either o(sQa\ . . . s n )(a n+ \) or x{s$a\ . . . s n )(a n+ \ ) 
depending on whether s n belongs to SMax or to Sjviin- Fixing the initial state sGS these probabilities and 
the transition probability 8 yield the following probabilities 



u : (SA) 



(0 



R 




(1) 



is the probability of the cone 0(sq), 



Pf ,r (s ai . . . s n a n+ i | s Q a\ ...s n ) = 



a(soa\ . ..s„)(a n+ i) if s„ G S M ax 
x(s ai...s n )(a n+1 ) ifs„ G S M in 



(2) 
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is the conditional probability of ^(jo#i • • -*««n+i) given &{sqci\ . . . s n ) and 

Ff' z (s ai ...s n a n+1 s n+ i \ s Q a\ . . .s n a n +{) = 8(s n ,a n+l ,s n+l ) (3) 

is the conditional probability of the cone €?(soai . . .s n a n+ \s n+ \) given the cone 6{s§a\ . . .s n a n+ \). 

Ionescu Tulcea's theorem |[T4l implies that there exists a unique probability measure Pf ,T on the 
measurable space ((SA) ffl ,^(SA) ffl ) satisfying ©, © and ©. 

2.4 Optimal strategies 

Let £/ = (S,SMax,SMin,A,(A(5)) ie s,5) be an arena. In the sequel we assume that all payoff mappings 
u : (SA) ffl — > R are bounded and measurable (for measurability we assume that (SA) ffl is equipped with 
the a-field described in the preceding section and R is equipped with the a-field ^(R) of Borel sets). 

Given an initial state s and strategies o and x of Max and Min the expected value of the payoff u 
under Pf ,T is denoted Ef ' T [u]. 

A strategy a" for player Max is said to be optimal in a game u) if for every state s, 

infEf" ' T M = supinfEf' T [«] . 

Dually a strategy t" of player Min is optimal if sup CT Ef ^ [u] = inf T sup CT Ef ' T [w], for each state s. 
In general, 

vaL(u) := supinf Ef ' T [«] < inf supEf ' T [u] := vaT^w) 

but when these two quantities are equal then the state s is said to have the value val s (w) = yal s (w) = 
val s (tt), denoted also va\ s (u,£/) whenever mentioning explicitly the arena is needed. Under the hypoth- 
esis that u is measurable and bounded, Martin's theorem [11]] guarantees that every state has a value. 
Notice however that Martin's theorem does not guarantee the existence of optimal strategies. 

3 Discounted Games 

Arenas for discounted games are equipped with two mappings denned on the set S of states. The discount 
mapping 

A : S — > [0, 1) 

associates with each state s a discount factor A (s) G [0, 1) and the reward mapping 

r : S — > R (4) 

maps each state s to a real valued reward r(s). 
The payoff 

u x : (SA) ffl — ► E 

for discounted games is calculated in the following way. For each play p = SQaoS\a\S2a2 ... € (SA)* 8 
u x {p) = (1 - X(soMs Q ) + A(j )(1 - A(ji))r(ji) + A(j )A(ji)(1 -X(s 2 ))r(s 2 ) + ... 

CO 

= £A(j )...A(j i _i)(l-A(.s i )MJi) • (5) 

i=0 
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Usually when discounted games are considered it is assumed that there is only one discount factor, 
i.e. that there exists 2 £ [0,1) such that X(s) = X for all s G S. But for us it is essential that the discount 
factor depends on the state. 

Shapley US proved^ that 

Theorem 2 (Shapley). Discounted games (&/,ux) over finite arenas admit optimal deterministic mem- 
ory less strategies for both players. 

3.1 Interpretations of discounted games 

The rather obscure formula [5] can be interpreted in several ways. The usual economic interpretation 
is the following. The reward r(s) represents the payoff that player Max receives if the state s is vis- 
ited. But a given sum of money is worth more now than in the future, visiting Sj at stage i is worth 
X(s\)...X(si-i)r(si) rather than r(si) (visiting 57 is worth r(si) only the first day). With this interpre- 
tation YT=o^( s o) • ■ ■ ^( s i-i) r ( s ') represents the accumulated total the payoff that player Max receives 
during an infinite play. However, with this interpretation it is difficult to assign a meaning to the factors 
(1 — A (57)) and such factors are essential when we consider the limit of u% with discount factors tending 
to 1. 

In his seminal paper |[T3l Shapley gives another interpretation of ([5]) in terms stopping games. Sup- 
pose that at a stage i a state s; is visited. Then with probability 1 — X (*,•) the nature can stop the game. 
Since we have assumed that < X{s) < 1 for all jgS, the stopping probabilities are strictly positive 
which implies that the game will eventually stop with probability 1 after a finite number of steps. 

If the game stops in Si then player Max receives from player Min the payment r(s,) and this ends 
the game. Thus here player Max receives the payoff only once, when the game stops and the payoff is 
determined by the last state. 

If the game does not stop in then there is no payment at this stage and the player controlling the 
state Sj chooses an action to execute. 

Note that X(sq) . . . A(s,-_i)(l — A (57)) gives the probability that the game has not stopped in any of 
the states so, . . . , but it does stop in the state Sj. Since this event results in the payment r(sj), ([5]> 
represents in this interpretation the payoff expectation for an infinite play soaoS\a\S2a2 ■ ■ ■ during the 
stopping game. 

Another related interpretation making a direct link between discounted games and mean-payoff 
games is the following. We transform the discounted arena srf into a new arena by attaching to 
each state s G S a new state s*. We set r(s*) = r(s), i.e. each new adjoined state has the same reward as 
the corresponding original state. 

In the new arena £/* we incorporate the discount factors directly into the transition probabilities. 
Recall that, for each state s G S of the original arena 8(s,a,s') was the probability of going to a state 
s' if an action a is executed in s. In the new arena £/* this probability is set to 8*(s,a,s') = X(s)8(s,a,s'). 
On the other hand we set also S*(s, a, s*) = ( 1 — X (s)), i.e. in £/* with probability 1 — X (s) the execution 
of a in s leads to s* (note that for fixed a the probabilities sum up to 1). 

Each new state s* is absorbing, there is only one action available in each s*, we note it *, and this 
action leads with probability 1 back to s*. This situation is illustrated by the following picture. 



In fact, Shapley considered a much larger class of stochastic games. For these games he proved that both players have 
memoryless optimal strategies. For perfect information games his proof yields optimal strategies that are also deterministic. 
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We consider the mean-payoff game played on srf*, i.e. the game with the payoff u r {sQOQS\a\ ■■■) = 
limsup^. ^p[Y!i=o r ( s i)- Such a game played on srf* ends with probability 1 in one of the starred states 
s* and then the mean-payoff is simply r(s*) = r(s). Intuitively, stopping in s with the payoff r(s) in the 
stopping game is the same as going to s* and looping there infinitely with the same mean-payoff r(s*). 
Thus a discounted game can be seen as a mean-payoff game played on an arena where with probability 
1 we end in some absorbing state. If discount factors tend to 1 then this means that, intuitively, we cut 
off the absorbing starred states of 

4 Blackwell optimality 

We will consider what happens if the discount factors tend to 1. The novelty in comparison with the 
traditional approach is that we consider the situation where discount factors of different states tend to 1 
with different rates. 

A rational discount parametrization is a family of mappings Xt = (Xt(s)) se s, such that for each state 

s, 

• 1 1->- Xt ( s ) is a rational! mapping of t, 

• there exists < £ < 1 such that X t (s) £ [0, 1) for all t 6 [1 — £, 1) (note that since the set of states 
is finite we can choose the same e for all states), 

• lim f +i X t (s) = 1. 

A typical example of a rational parametrization is the canonical rational discount parametrization 
defined in the following way. For each state s we fix a natural number n(s) € N called the priority of s 
and a positive real number w(s) 6 (0,°°) called the weight of s. Then the canonical parametrization is 
defined as 

Xt(s) = l-w{s)(l-t)^ s \ forseS,fGM. (6) 

We will consider discounted games where discount factors are given by a rational discount parametriza- 
tion. 

Theorem 3 (Blackwell optimality). Let us fix an arena srf of a discounted game and let Xt be a rational 
discount parametrization for stf . Let val s (w^) be the value of a state s € Sfor Xt in the game {srf ,ux,)- 
Then there exists < £ < 1 such that, for each state s, 

(1) for t G (1 — £, 1), 1 1— )■ val v (ii^ ; ) is a rational function oft and 

(2) if and t" are optimal deterministic memoryless strategies for some t € (1 — £, 1) then a' and t" 
are optimal for all t E (1 — £, 1). 

2 Rational in the sense that X, (s) is a quotient of two polynomials of 1. 
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In the sequel we call strategies a" and x* Blackwell optimal for a rational discount parametrization 
Xt if a" and t" are deterministic memoryless strategies satisfying part (2) of Theorem[3] 

Let us note that Theorem [3] exhibits a curious property of discounted games discovered by Black- 
well l3lffl By Theorem [2] we know that for each fixed t the discounted game with payoff ux, has optimal 
memoryless deterministic strategies, but obviously such strategies depend on t. Theorem [3] asserts that 
for t G (1 — e, 1) the situation stabilizes and optimal deterministic memoryless strategies do not depend 
on t. Since Blackwell optimality is usually proved only for Markov decision processes with a unique dis- 
count factor for all states, see iPTOl for example, we decided to include the complete proof of Theorem [3] 
Note however that our proof follows closely the one used for Markov decision processes. 

The proof of Theorem [3] is based on the following lemma that will be useful also in the next section. 

Lemma 4. Let 1 1— >■ X, be a rational discount parametrization and let CJ, Z be deterministic memoryless 
strategies. Then, for each state s, and for t sufficiently close to 1, Ef' T [ux t ] is a rational function oft. 

Proof. The proof is standard but we give it for the sake of completeness. The set M SxS of functions from 
S x S into real numbers can be seen as the set of square real valued matrices with rows and columns 
indexed by S. In particular M SxS is a vector space with natural matrix addition and scalar multiplication. 
However, matrix multiplication defines also a product on R SxS , for M,N 6 M SxS , MN is an element 
U of M SxS with entries Ills', s"] = L s eS M l s> ^W[s,s"]. We endow R SxS with a norm, for M € M SxS , 
||M|| = max. s , eS 2> eS \M[s',s"}\. It can be easily shown that \\MN\\ < \\M\\ -\\N\\ for M,N € M SxS and 
jgiSxS j s a com piete metric space for the metric induced by the norm 1 1 • 1 1, see Section 3.2.1 of lfl5l for a 
proof. 

On the other hand, we consider also the vector space M s of functions from S into E, they can be 
seen as column vectors indexed by states. Of course if M £ M SxS and v£R s then Mv € R s , where 
(Mv)[s] = E/esMVMs'] for jGS. 

We equip M s with a norm, for v € M s , | |v| |oo = max se § |v[*]|. The norms on M SxS and M s are com- 
patible in the sense that we have | \Mv\ ^ < | \M\ | • | |v| |«,. 

Let a,T be deterministic memoryless strategies for players Max and Min and let X t be a rational 
discount parametrization. We define 



Thus 8 defines transition probabilities of the Markov chain obtained when we fix the strategies a and z. 
In the sequel M will denote the element of ]R SxS defined in the following way 




M[s',s"]=\ t (s')8(s / ,s"), for/,/, 



GS 



(7) 



Let / G M SxS be the identity matrix, i.e. I[s',s"] is 1 if s' = s" and otherwise. 
We shall show that for t close to 1 the matrix (/ — M) is invertible and 



(I-M)- 1 = £M''. 



(8) 



i'=0 



First we show that the series on the right-hand side of ([8]) converges. 

In fact Blackwell (3] considered only one-player games with the same discount factor for all states. 
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Let Xm = max se s Xt(s). Then for t sufficiently close to 1 we have ||M|| < Xm < 1 and, for k < 1, 

40 



M A M 



1 — Xm k,i^r<*> 



since, by the definition of a rational discount parametrization, < Xm < 1 for j* sufficiently close to 1. 
Thus the series £J1 M' satisfies the Cauchy condition and the convergence follows from the complete- 
ness of the norm 1 1 • 1 1 . Now it suffices to note that 



and | \M' 



< 



k+i < X k+\ 



71/ 



(I -My 1 -Y^M 1 -l = M 
4 which yields ©. 



k+l 



Let {Si)°° =0 be the stochastic process giving the state at stage i. Then 



£MSo)---M$-i)(i-W)Hs.O 

i=0 

= lim Ef ' T 



A- 

[ 

i=0 



£^(So)--W-i)(l -*t(Si))r(Si) 



(9) 



where the second equality follows from the Lebesgue dominated convergence theorem. 
Let v be an element of M s defined as 

v[s] = (l-Xt(s))r(s), for^eS. 

An elementary induction on i shows that, for s,s' G S, 

Ef - x [Xt (So) • ■ ■ A, (S,-_ L ) |S = *,S ; = ^ = M' [j, , 

i.e. the entry [jr, j?'] of the z'-fh power of M is the expectation of A* (So) ■ • • A* (S,-_i) under the condition that 
So = s and S, = s'. This yields 

(MV)M = £M'V] -vM = £ Ef- 1 [A f (So)---A,(S ; _ 1 )|S = 5,5,- =s'] • (1 - A^'))>V) = 
s'eS .v'eS 

Ef ' T [A, (S ) • • • A t (S i -_ 1 )(l - A,(S))r(S,-)|S = *] . (10) 



Taking the sum from i = to /< on both sides of (flOl) and next the limit with k tending to infinity, using 
© and ®, we obtain 

Ef'* [^]=((/-M)- 1 v)W- 

But the elements of the matrix I — M are rational functions of thus Cramer's rule for matrix in- 
version show that (/ — M) _1 has also rational elements, and since the elements of v are also rational 
functions we can see that Ef ' T [ux,] is a rational function of t. 

□ 



16 



Blackwell optimal strategies 



Proof of Theorem^} According to Lemma HI and since discounted games admit optimal deterministic 
memoryless strategies, (1) is a consequence of (2). 
We prove (2) as follows. 

Let X be the set of all tuples (q, a, T, a', z'), where q is a state, a, a' are deterministic memoryless 
strategies for player Max and t, t' are deterministic memoryless strategies for player Min. Note that for 
finite arenas X is finite. Let A, be a rational discount parametrization and let < £ < 1 be such that 
Xt{s) G (0, 1) for all states s and all t 6 (1 — £, 1). 

For each (q, o, z, o', %') € X we consider the function & q ^,i:,&,i' : (1 — £, 1) — >• R defined by: 

t >-> %,o,z,G',T'(t) = E G q x [u m ] -E°'> T ' [u m ] . 

According to Lemma 0] Qq^^.o' ,i'{t) is a rational function of t for t sufficiently close to 1. Since 
a rational function can change the sign (cross the .x-axis) only finitely many times there exists E\ = 
£i(<7,a,T,a',T') > such that the sign of ^> q ,a,t,a' ,t'(0 does not change in the interval (1 — £i, 1). Let 
£ 2 = min{£} u {£i {q, a, t, a', r') : (q, a, r, a', r') e X}. 

Since X is finite the minimum on the right is taken over a finite set of positive numbers and we 
conclude that £ 2 > 

Let us take any f € (1 — £2, 1). Let a", be optimal deterministic memoryless strategies in the 
discounted game {srf ',m^) (Theorem |2]). Then, in particular, we have 

E°/[ l% ]<Ef^[u x ]<Ef^[u x ] (11) 

for all deterministic memoryless strategies a, %. We can rewrite (fTTTt as <I> CT j T j a T « (?) > and 

rati _t 0- However if these inequalities hold for some f £ (1-62,1) then we have seen that 
they hold for all t £ (1 - £2, 1). Therefore ([II]) holds for all t <E (1 - £2, 1). Finally Theorem |2]implies that 
if (fTTT > holds for all deterministic memoryless strategies a and X (with fixed deterministic memoryless 
a' and t") then it holds for all strategies a, 10 □ 



5 Priority mean-payoff games 

In mean-payoff games the players try to optimize (maximize/minimize) the mean value of the payoff 
received at each stage. In such games the reward mapping 

r:S — >R (12) 

gives, for each state s, the payoff received by player Max when s is visited. The payoff of an infinite play 
is defined as the limit of the means of daily payments: 

1 k 

u r (soSiS 2 ■ ■ ■) = limsup— ' ( 13 ) 

k K + 1 ; . =Q 

where we take limsup rather than the simple limit since the latter may not exist. 

We slightly generalize mean-payoff games by equipping arenas with a new mapping 

w : S — > R+ (14) 

4 In other words, for discounted games being optimal in the class of memoryless deterministic strategies implies being 
optimal in the class of all strategies. 
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associating with each state s a strictly positive real number w(s), the weight of s. We can interpret w(s) 
as the amount of time spent in state s upon each visit to s. In this setting r(s) should be seen as the payoff 
by a time unit when s is visited, thus the weighted mean payoff received by player Max is 



i 

l!i=o w ( s i) 



Mw2.-0=limsup ^r T • d5) 



Note that in the special case when the weights are all equal to 1 , the weighted mean value (fl3T ) reduces 
to 03). 

As a final ingredient we add to the arena a priority mapping 

7r:S^N (16) 

assigning to each state s a positive integer priority 7l(s). 

We define the priority of a play p = so#oJi#i£2#2 ... as the smallest priority appearing infinitely often 
in the sequence 7i(so)n(s\)7l(s2) ... of priorities visited in p: 

n(p) = liminf n(sj) . (17) 

i 

For any priority a, let \ a : S — > {0, 1} be the indicator function of the set {s G S | 7t(s) = a}, i.e. 

1 ( ) { if K ^ = a (18) 
|0 otherwise. 

Then the priority mean-payoff of a play p = s§a$s \a\S2a2 ... is defined as 



LLo l 7t( P )(si)-w(si)-r(s i) 
k'' Id=doh:(p){Si)-w(Si) 



u r ,w,7t{p) =limsup 1 ^ k W — — — — . (19) 



In other words, to calculate priority mean payoff u nWjJt (p) we take weighted mean payoff but with the 
weights of all states having priorities different from n{p) shrunk to 0. (Let us note that the denominator 
£i_o l«(p) ( s i) is different from for k large enough, in fact it tends to infinity since ln(p)( s i) = 1 

for infinitely many i. For small k the numerator and the denominator can be equal to and then, to avoid 
all misunderstanding, it is convenient to assume that the indefinite value 0/0 is equal to —00.) 

In the sequel the couple (w, n) consisting of a weight mapping and a priority mapping will be called 
a weighted priority system. 

Let us note that priority mean-payoff games are a vast generalization of parity games. In fact parity 
games correspond to a very particular case of priority mean-payoff games, we recover the usual parity 
games when we set for each state s, w{s) = 1 and r(s) = 1 if % (s) is even and r(s) = if 7i(s) is odd. 

Theorem 5. Priority mean-payoff games over finite arenas admit optimal deterministic memoryless 
strategies for both players. 

Proof. The proof of Theorem [5] relies on the transfer theorem proved in (H. This theorem states the 
following: if a payoff function u admits optimal deterministic memoryless strategies in all one-player 
perfect information stochastic games over finite arenas equipped with payoff u or — u, then all two-player 
perfect information stochastic games over finite arenas with payoff u have also optimal deterministic 
memoryless strategies for both players. 
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Blackwell optimal strategies 



In (6l, we proved that one-player games equipped with the payoff function u rwJ[ have optimal de- 
terministic memoryless strategy. It remains to prove the same for one-player games equipped with the 
payoff function — u r ^ n : 

, \ y f -lLoU{p)i.s)Msi)'r(si) 
-u r>Wtn (sos l s 2 ---)=hmmf — j— — — . (20) 

Let us denote — r the reward mapping defined by (— r){s) = —r(s). Then, 

" L k i=QU( P ){s)-w{si)-r {si) 
l!i=0 l 7t( P )^i)-w(si) 



U- r>w ,7i(sQSiS 2 ---) =limsup n _f> - - — \ - ' . (21) 



The expected values of 

— Urwn and u— rw ji coincide on Markov chains, because in a Markov chain, the 
limsup in (|2TI ) is almost-surely a limit, see the proof of Theorem 7, page 8 of (6J. Since for every play, 
—u r , w ,it{p) < u- r ^ WJl {p), this implies that in a one-player arena, every deterministic memoryless strategy 
optimal for the payoff function u— r ^ w .7i is optimal for the payoff function — u r ^^ as well, and these two 
games have the same values and the same deterministic memoryless optimal strategies. This completes 
the proof. □ 



6 From rationally parametrized discounted games to priority mean-payoff 
games 

6.1 Priority mean-payoff derived from rational discount parametrization 

The aim of this short subsection is to show how a rational discount parametrization induces in a canonical 
way a weighted priority system. 

Let Xt be a rational discount parametrization. The fact that lim^i(l — Xt{s)) = implies that for 
each state s, the function t *—> 1 — Xt(s) factorizes as g s (t)(l — t) n ^ s > where 7i(s) € N is a positive integer 
constant and t h-» g s (t) is a rational function such that g lV (l) / 0. Moreover since 1 — Xt (s) is positive for 
t G (1 — e, 1), g s (t) is also positive in the same interval and by continuity of g s {t), gs(l) > 0. 

Now, for each state s, take n(s) defined above as the priority of s and w(s) := g s {l) as the weight of s. 
We say that (w, %) defined in this way is the weighted priority system derived from the rational discount 
parametrization Xf. 



6.2 Limit of a discounted game 

The following theorem establishes a remarkable link between discounted games and weighted priority 
mean-payoff games. Roughly speaking it shows that the latter are the limit of discounted games, the 
limit not only in the sense of game values (part (a)) but also the optimality of strategies is preserved in 
the limit. 

Theorem 6. Let g/ be a fixed arena and let 1 1— > Xt be a rational discount parametrization for stf . Let 
(w, 7r) be the weighted priority system derived from X t . Finally let a' and t" be deterministic memoryless 
Blackwell optimal strategies for the discounted game (£/,ux,)- 
Then 

(a) for each state s, lim t ^\ val s (u^) = va\ s (u r ^ K ), where val s (w^) is the value of the game (srf ,ux,) and 
val s (u m7t ) is the value of the game {srf \u rtW)lt ), and 
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(b) if a" and t" are Blackwell optimal memoryless deterministic strategies for the discounted game 
(si \ux t ) then a" and X* are optimal for the priority mean-payoff game (si,u rwn ) 

Let us note that part (a) of Theorem [6] was proved in J6l but only for one-player game^] (Markov 
decision processes). 

However, in (6l we were unable to establish any result linking optimal strategies for discounted 
games with optimal strategies of weighted priority games. Thus the main achievement of the present 
paper is part (b) of Theorem [6] 

The following result was proved in |6] (Theorem 7 in |[6l): 

Lemma 7. Let Xt he a rational discount parametrization and let (w, 7t) be the derived weighted priority 
system. Then for each state s and for all deterministic memoryless strategies <J, X: 

limE?' T [u Ht) ] =E^[u r>Wi% ). 
rfi 

Proof of Theorem® We begin with part (b). Let a", x* be Blackwell optimal deterministic memoryless 
strategies for Xt. Let a and x be any deterministic memoryless strategies of players Max and Min. Then 

Taking the limit with 1 1 1 we get by Lemma |7] 

which shows that a* and t" are optimal in the class of deterministic memoryless strategies. But Theo- 
rem[5]implies that for priority mean-payoff games strategies optimal in the class of deterministic memo- 
ryless strategies are optimal also when all strategies are allowed. This terminates the proof of (b). 
Obviously (a) follows from (b) and from Lemma|7J 

□ 



7 Optimal but not Blackwell optimal strategies 



top 




top 




r = 


right 


r = 1 


7T= 1 

^Max 




7T = 2 
^ Min 







left 



Figure 1 : A parity game. Player Max has two deterministic memoryless optimal strategies but only one 
of them is Blackwell optimal. 

Theorem [6] stated that Blackwell optimal strategies are also optimal for priority mean-payoff games. 
The converse is not true, the notion of Blackwell optimal strategies is strictly more restrictive. 



5 In fact, (6j shows that the convergence of game values holds not only for rational parametrizations but for any "reasonable" 
parametrization of discount factors. 
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Blackwell optimal strategies 



We illustrate this with the game presented in Figure Q] Here we have two states s Max , s M i n controlled 
respectively by players Max and Min. Both states have the same weight 1 which is omitted. The left 
state has priority % = 2 and reward r = 0, the right state has priority % = 1 and reward r = 1, thus essen- 
tially this is the usual parity game with two priorities. Both players have two deterministic memoryless 
strategies. The optimal strategy for player Min is to take action "left". With this strategy state SMax with 
priority 1 is visited infinitely often and since this is the minimal priority in this games the resulting payoff 
will whatever the strategy of player Max. Player Max can play "top" or "right", in both cases if player 
Min uses the strategy described above the payoff is thus both strategies are optimal for Max. 

Now let us consider the associated discounted game with the canonical parametrization. Thus the 
discount factor of SMax is A* (^Max) = 1 — (1 — t) n ^ Su '^ = t while the discount factor for syim i s ^(^Min) = 
1 — (1 — t) n ( smn > = 1 — (1 — t) 2 . For player Min the optimal strategy is still to always play "left". For 
player Max the strategies "right" and "top" are now different. For example if we start from SMax tnen 
playing "top" will result in payoff since we will visit only the state SMax with reward 0. On the other 
hand playing "right" we will visit infinitely often the state s^w with a positive reward, thus for discounted 
games playing "right" is strictly better for Max than playing "top" and the strategy where Max plays 
"right" is the only Blackwell optimal strategy. 

The main motivation behind Blackwell optimal strategies comes from the following observation (due 
to Blackwell). Consider a mean-payoff game controlled completely by player Max and suppose that there 
are only two possible infinite plays. The first play begins with a long but finite sequence of rewards 
followed by an infinite sequence of rewards 1 . The mean payoff for such history is 1 , the initial sequence 
of does not count on the limit. Consider now the second play which is an infinite sequence of rewards 
1, without any 0. Here also the mean payoff is also 1. Thus player Max is indifferent between two 
histories. But from the point of view of Maximizer clearly the second history is better than the first one, 
one prefers to have the reward 1 each day rather than to begin with the reward 0. This difference is 
captured by Blackwell optimality. 
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